On January 12, 2010, a magnitude 7.0 earthquake struck Haiti causing significant damage, which affected approximately 3 million citizens. In the wake of the disaster, aid groups were working to locate displaced persons and provide them with food and water. However, due to the large scale destruction of infrastructure over a wide area additional assistance was needed to locate people quickly.
Little is left of a neighborhood on a hillside near downtown Port-au-Prince on Jan. 15. More than a million people were displaced by the quake. (David Gilkey/NPR)
Displaced persons created make-shift shelters out of blue tarps. In order to locate displaced persons quickly, high resolution geo-refereneced images were captured by aircraft of the destroyed areas. The data generated by the image collection was too large for aid workers to process in time to supply aid. Therefore, a team from the Rochester Institute of Technology used data-mining algorithms to analyze the images and identify blue tarps. The goal was to effectively locate displaced persons as quickly as possible and communicate their location to rescue workers so they could deliver resources to people in time.
Sample image of a geo-referenced image used for the analysis
As the final project for SYS 6018 - Data Mining, we were assigned to use techniques we learned in the course to build models that would as accurately as possible, and in as timely a manner as possible, locate the greatest number of the displaced persons from the provided imagery data. The data made available to students consisted of a csv of red, green, blue pixel values and a class indicator which indicated if a pixel was representative of a blue tarp or something else like vegetation.
We were also provided multiple text files that contained data and extra information to be used as a final hold out test set. We were expected to wrangle the data into a usable format.
The US Government spent $1.5B on Haiti disaster relief by the end of 2010. For this project, we will assume that $2,000,000 was allocated to our team to deliver supplies to displaced individuals in the immediate aftermath. Our team has been assigned an area where 2,022 displaced people are believed to be (this is the number of blue tarps are in our training data set). Anything less than a 80% delivery success rate will be considered a disaster relief failure. 80% of 2,022 people is 1,618. Therefore, if we fail to locate 404 of our blue tarps our mission would be considered a failure.
These considerations will help guide the selection of thresholds later in the analysis.
| Budget | $750,000 |
| Cost per Delivery (True Positive) | -$750 |
| Cost per Mis-Delivery (False Positive) | -$300 |
| Cost per Missed Delivery (False Negative) | -$4,946 |
| Cost per True Negative | $0 |
tic("Run Time")
The data provided for analysis was generated from overhead images and stored as a three channel output. Each pixel also had a classifier label indicating whether it was a blue tarp or something else like vegetation or soil. The channels represented the red, green, and blue values for pixels within images. RGB color model is referred to as an additive model. The integer value for the red, green, and blue channels are combined to represent a color. Typically, the component values are stored as an 8 bit integer ranging from 0 to 255.
The data was visualized with the ggpairs function. For a pair of variables chosen from the data frame, Ggpairs generates a scatterplot, displays a Pearson correlation, and, on the diagonal, shows a variable distribution. The plots were also color-coded by class. The class label describes what kind of object a pixel is associated with. In our data frame there were the following classes: Blue Tarp, Rooftop, Soil, Various Non-tarp, and Vegetation. The 2D representation of the data only gives us a partial insight into the behavior and relationships of the predictors. Since three channels are used to generate a color, plotting the data in 3D to investigate trends and behavior between classes will be more meaningful.
The 3D scatter plot shows a significant amount of overlap between the different classes. It is worth noting that it is possible to see some separation between the classes.
df <- tibble(read.csv("HaitiPixels.csv")) #read in df
"Check for NA values"
anyNA(df) #check for NA values
"Summary of Data"
summary(df) #quick look at data
df$Class <- factor(df$Class) #make Class a factor variable.
#> [1] "Check for NA values"
#> [1] FALSE
#> [1] "Summary of Data"
#> Class Red Green Blue
#> Length:63241 Min. : 48 Min. : 48.0 Min. : 44.0
#> Class :character 1st Qu.: 80 1st Qu.: 78.0 1st Qu.: 63.0
#> Mode :character Median :163 Median :148.0 Median :123.0
#> Mean :163 Mean :153.7 Mean :125.1
#> 3rd Qu.:255 3rd Qu.:226.0 3rd Qu.:181.0
#> Max. :255 Max. :255.0 Max. :255.0
We can see from the output that there aren’t any NA values that need to be removed or adjusted. The values in each predictor column fall within the expected 0 to 255 range.
#Reference [1]
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#CC79A7", "#0072B2", "#D55E00")
#"#F0E442",
#view scatter and correlations
p <- ggpairs(df[,2:4], lower.panel = NULL, upper = list(continuous = wrap("cor", size = 3)), aes(color=df$Class)) #+ scale_fill_manual(values=cbPalette)
#Reference: https://stackoverflow.com/questions/34740210/how-to-change-the-color-palette-for-ggallyggpairs/34743555
for(i in 1:p$nrow) {
for(j in 1:p$ncol){
p[i,j] <- p[i,j] +
scale_fill_manual(values=cbPalette) +
scale_color_manual(values=cbPalette)
}
}
p
attach(df) #attach df variables
fig <- plot_ly(df, x=~Red, y=~Green, z=~Blue, color=~Class, colors=c("#999999", "#E69F00", "#56B4E9", "#009E73", "#CC79A7")) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene=list(xaxis=list(title="Red"),
yaxis = list(title = 'Green'),
zaxis = list(title = 'Blue')))
fig
As noted in the previous section, the data provided is sufficiently cleaned only one further adjustment to the data frame is needed. Since our main interest is to predict whether a pixel represents a blue tarp or not a blue tarp, the Class column of the data frame needs to be converted into a binary indicator for blue tarp or not blue tarp.
df <- cbind(mutate(df, "Blue_Tarp_or_Not"=ifelse(Class != "Blue Tarp", 0, 1))) #add binary column indicating whether the Class variable is "Blue Tarp" or not
attach(df)
df$Blue_Tarp_or_Not <- factor(Blue_Tarp_or_Not, labels = c("NBT", "BT"))#, levels =c(0,1), labels = c("NBT", "BT")) #ensure new column is a factor
"First Six Rows of Data Frame"
head(df)
df_factor <- df[, -1]
"Last Six Rows of Data Frame"
tail(df_factor)
attach(df_factor)
#> [1] "First Six Rows of Data Frame"
#> Class Red Green Blue Blue_Tarp_or_Not
#> 1 Vegetation 64 67 50 NBT
#> 2 Vegetation 64 67 50 NBT
#> 3 Vegetation 64 66 49 NBT
#> 4 Vegetation 75 82 53 NBT
#> 5 Vegetation 74 82 54 NBT
#> 6 Vegetation 72 76 52 NBT
#> [1] "Last Six Rows of Data Frame"
#> Red Green Blue Blue_Tarp_or_Not
#> 63236 136 145 150 BT
#> 63237 138 146 150 BT
#> 63238 134 141 152 BT
#> 63239 136 143 151 BT
#> 63240 132 139 149 BT
#> 63241 133 141 153 BT
fig1 <- plot_ly(df_factor, x=~Red, y=~Green, z=~Blue, color=~Blue_Tarp_or_Not, colors = cbPalette) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig1 <- fig1 %>% add_markers()
fig1 <- fig1 %>% layout(scene=list(xaxis=list(title="Red"),
yaxis = list(title = 'Green'),
zaxis = list(title = 'Blue')))
fig1
After the class label is converted into a binary classifier, it is easier to see separation between the data points for blue tarps and not blue tarps.
A few general notes on the various metrics and considerations that will be explored for each model.
ROC
The receiver operating characteristic or ROC is a graphic which illustrates the true positive percentage and the false positive percentage produced by the fit model at different threshold values. Generally, we seek to maximize the true positive rate (TPR) and minimize the false positive rate (FPR). For this project, are interested in maximizing the TPR and minimizing the FPR. However, because we are creating models that will assist in saving lives we are very concerned with the false negative rate. We want to avoid missed blue tarps, as best we can with the budget we’re allowed. Due to the imbalance in the ratio of our BT and NBT observations, we must be concerned with the FPs (NBTs flagged as BTs) depleting our available resources before we are able to find all the blue tarps (TP).
AUC
The area under the curve or AUC is a metric that indicates how well a model can differentiate between two classes.The higher the AUC the better the model is at predicting class labels correctly.
Thresholds
It can be difficult to determine what threshold to choose for your model from just an ROC curve alone. To aid in threshold selection I decided to graph the TPR, FPR, and threshold in a 3-D scatterplot. As you will see from the interactive graphics, it’s much easier to determine which threshold will give you the model performance that is appropriate for your application. In our case, our threshold selection is guided by the aim of maximizing the TPR while minimizing the FPR and ensuring that our operation does not exhaust our budget by pursuing incorrectly labeled not blue tarps (false positives) before we reach our success criterion.
Confusion Matrix
The confusion matrix displays the true negatives, false negatives, false positives, and true positives giving a sense of the performance of a model. These values are used to calculate metrics which are commonly used to compare the performance of models such as the TPR (also known as sensitivity or recall), FPR, FNR, true negative rate (specificity), precision, and many more.
Sampling Variability
Understanding the confidence of the model you’ve created is critical to understanding how well your model will perform when given a new data set with a different values. How much variability can you expect from your model?
Logistic regression is classification technique, which, for our purposes, will be used to perform binary classification. Our logistic regression model calculates the probability that an observation is a blue tarp. The method of maximum likelihood is used to calculate the log likelihood for a prediction which is converted to a probability. Whether these probabilities are classified as a blue tarp or not a blue tarp is dependent on the threshold value chosen. A probability above the threshold value is assigned a blue tarp label and a probability below the threshold is assigned a not blue tarp label.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
tic("Log. Reg.")
glm.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_factor,
method="glm",
family="binomial",
trControl= fitControl)
glm.time <- toc(quiet = TRUE)
glm.fit
"Summary"
summary(glm.fit)
#> Generalized Linear Model
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9952879 0.9207043
#>
#> [1] "Summary"
#>
#> Call:
#> NULL
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.4266 -0.0205 -0.0015 0.0000 3.7911
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.20984 0.18455 1.137 0.256
#> Red -0.26031 0.01262 -20.632 <2e-16 ***
#> Green -0.21831 0.01330 -16.416 <2e-16 ***
#> Blue 0.47241 0.01548 30.511 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 17901.6 on 63240 degrees of freedom
#> Residual deviance: 1769.5 on 63237 degrees of freedom
#> AIC: 1777.5
#>
#> Number of Fisher Scoring iterations: 12
The model fit summary output indicates that the three predictors all have significant p-values, meaning that they all are contributing to the model. The accuracy output indicates that over ten folds the average accuracy was 99.5%. Further discussion of the variability over ten folds can be found in the Sampling Variability tab.
#pass
glm.prob <- predict(glm.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
glm_roc <- roc(df_factor $Blue_Tarp_or_Not, glm.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="Log. Reg. ROC Curve")
An AUC of 99.9% indicates that our model is able to distinguish between the classes almost perfectly. It should be possible to choose a threshold that can identify most of the training blue tarps without going over our budget.
roc.info_glm <- roc(df_factor$Blue_Tarp_or_Not, glm.prob[,2], legacy.axes=TRUE)
roc.glm.df <- data.frame(tpp=roc.info_glm$sensitivities*100, fpp=(1-roc.info_glm$specificities)*100, thresholds=roc.info_glm$thresholds)
#roc.glm.df[roc.glm.df>98.5 & roc.glm.df < 99,]
glm.thresholds <- data.matrix(roc.glm.df$thresholds)
fig2 <- plot_ly(roc.glm.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig2 <- fig2 %>% add_markers()
fig2 <- fig2 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig2
lr.thresh <- 0.03265
glm.pred_thresh <- factor(ifelse(glm.prob[,2]>lr.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.glm_thresh <- confusionMatrix(factor(glm.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
lr.thresh
cm.glm_thresh
acc_LR <- cm.glm_thresh[["overall"]][["Accuracy"]]*100
auc_LR <- glm_roc[["auc"]]
thresh_LR <- lr.thresh
sens_LR <- cm.glm_thresh[["byClass"]][["Sensitivity"]]*100
spec_LR <- cm.glm_thresh[["byClass"]][["Specificity"]]*100
FDR_LR <- ((cm.glm_thresh[["table"]][2,1])/(cm.glm_thresh[["table"]][2,1]+cm.glm_thresh[["table"]][2,2]))*100
prec_LR <- cm.glm_thresh[["byClass"]][["Precision"]]*100
F.glm <- round(2*((prec_LR*sens_LR)/(prec_LR+sens_LR))/100, 2)
cost.glm <- round(cm.glm_thresh[["table"]][1]*0 + cm.glm_thresh[["table"]][2]*FP_C + cm.glm_thresh[["table"]][3]*FN_C + cm.glm_thresh[["table"]][4]*TP_C, 4)
cost.glm
#> [1] "Threshold:"
#> [1] 0.03265
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 60171 40
#> BT 1048 1982
#>
#> Accuracy : 0.9828
#> 95% CI : (0.9818, 0.9838)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.7761
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.98022
#> Specificity : 0.98288
#> Pos Pred Value : 0.65413
#> Neg Pred Value : 0.99934
#> Prevalence : 0.03197
#> Detection Rate : 0.03134
#> Detection Prevalence : 0.04791
#> Balanced Accuracy : 0.98155
#>
#> 'Positive' Class : BT
#>
#> [1] -1998740
A threshold of 0.03265 was selected for this model.
"10 Fold Results"
glm.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
glm.sd <- sd(glm.fit[["resample"]][["Accuracy"]]*100)
#plot(glm.fit[["resample"]][["Accuracy"]], main="Accuracy per Fold", xlab= "Fold Number", ylab="Accuracy")
#> [1] "10 Fold Results"
#> Accuracy Kappa parameter Resample
#> 1 0.9955724 0.9259239 none Fold01
#> 2 0.9963631 0.9393021 none Fold02
#> 3 0.9950980 0.9169589 none Fold03
#> 4 0.9943083 0.9038085 none Fold04
#> 5 0.9950980 0.9177836 none Fold05
#> 6 0.9960468 0.9340241 none Fold06
#> 7 0.9944655 0.9052940 none Fold07
#> 8 0.9955724 0.9262890 none Fold08
#> 9 0.9954143 0.9223164 none Fold09
#> 10 0.9949399 0.9153427 none Fold10
The average accuracy across ten folds is 98.28 with a standard deviation of 0.064. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.
LDA is a classification method that uses the maximum likelihood method to classify observations. LDA assumes each class has a normal distribution and that every covariance is the same.This creates a linear separation boundary between classes. The observation is assigned to the class with the highest calculated likelihood. Caret automatically does the calculation to convert the likelihood to a probability, which is what we use in conjunction with our chosen threshold to classify our data.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
tic("LDA")
lda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_factor,
preProcess=c("center","scale"),
method="lda",
verbose= FALSE,
trControl= fitControl)
lda.time <- toc(quiet = TRUE)
lda.fit
#> Linear Discriminant Analysis
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.983887 0.7524541
#pass
lda.prob <- predict(lda.fit, newdata=df_factor, type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
lda_roc <- roc(df_factor$Blue_Tarp_or_Not, lda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="LDA ROC Curve")
An AUC of 98.9% indicates that our model is able to distinguish between the classes pretty well. With a higher false positive percentage where the true positive percentage is near 100% it will likely be more difficult to choose a threshold that can identify most of the training blue tarps without going over our budget.
roc.info_lda <- roc(df_factor $Blue_Tarp_or_Not, lda.prob[,2], legacy.axes=TRUE)
roc.lda.df <- data.frame(tpp=roc.info_lda$sensitivities*100, fpp=(1-roc.info_lda$specificities)*100, thresholds=roc.info_lda$thresholds)
#roc.lda.df[roc.lda.df>91.5 & roc.lda.df < 91.6,]
fig3 <- plot_ly(roc.lda.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig3 <- fig3 %>% add_markers()
fig3 <- fig3 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig3
lda.thresh <- 0.01
lda.pred_thresh <- factor(ifelse(lda.prob[,2]>lda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.lda_thresh <- confusionMatrix(factor(lda.pred_thresh),df_factor$Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
lda.thresh
cm.lda_thresh
acc_lda <- cm.lda_thresh[["overall"]][["Accuracy"]]*100
auc_lda <- lda_roc[["auc"]]
thresh_lda <- lda.thresh
sens_lda <- cm.lda_thresh[["byClass"]][["Sensitivity"]]*100
spec_lda <- cm.lda_thresh[["byClass"]][["Specificity"]]*100
FDR_lda <- ((cm.lda_thresh[["table"]][2,1])/(cm.lda_thresh[["table"]][2,1]+cm.lda_thresh[["table"]][2,2]))*100
prec_lda <- cm.lda_thresh[["byClass"]][["Precision"]]*100
F.lda <- round(2*((prec_lda*sens_lda)/(prec_lda+sens_lda))/100, 2)
cost.lda <- round(cm.lda_thresh[["table"]][1]*0 + cm.lda_thresh[["table"]][2]*FP_C + cm.lda_thresh[["table"]][3]*FN_C + cm.lda_thresh[["table"]][4]*TP_C, 4)
cost.lda
#> [1] "Threshold:"
#> [1] 0.01
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 59944 231
#> BT 1275 1791
#>
#> Accuracy : 0.9762
#> 95% CI : (0.975, 0.9774)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.6921
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.88576
#> Specificity : 0.97917
#> Pos Pred Value : 0.58415
#> Neg Pred Value : 0.99616
#> Prevalence : 0.03197
#> Detection Rate : 0.02832
#> Detection Prevalence : 0.04848
#> Balanced Accuracy : 0.93246
#>
#> 'Positive' Class : BT
#>
#> [1] -2868276
Based on the performance of the LDA model, I was unable to find a threshold that satisfied the budget. Therefore, I chose a threshold that minimized cost and number of false negatives.
"10 Fold Results"
lda.fit$resample
lda.sd <- sd(lda.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#> Accuracy Kappa parameter Resample
#> 1 0.9857685 0.7793824 none Fold01
#> 2 0.9854522 0.7824666 none Fold02
#> 3 0.9829222 0.7328063 none Fold03
#> 4 0.9845059 0.7630316 none Fold04
#> 5 0.9807084 0.7153126 none Fold05
#> 6 0.9843454 0.7632954 none Fold06
#> 7 0.9802340 0.6914626 none Fold07
#> 8 0.9867173 0.7912156 none Fold08
#> 9 0.9840291 0.7507018 none Fold09
#> 10 0.9841872 0.7548660 none Fold10
The average accuracy across ten folds is 97.62 with a standard deviation of 0.208. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.
QDA is a classification method that uses the maximum likelihood method to classify observations. QDA assumes each class has a normal distribution. Each class has its own mean and covariance. The observation is assigned to the class with the highest calculated likelihood. Caret automatically does the calculation to convert the likelihood to a probability, which is what we use in conjunction with our chosen threshold to classify our data.
The decision boundary created by QDA is quadratic and allows for different covariances for the classes. Therefore, you will see better performance of a QDA model compared to a LDA model when the boundary between classes of data is non-linear and the classes have different covariance values.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
tic("QDA")
qda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_factor,
preProcess=c("center","scale"),
method="qda",
verbose= FALSE,
trControl= fitControl)
qda.time <- toc(quiet = TRUE)
qda.fit
#> Quadratic Discriminant Analysis
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9945763 0.9051517
#pass
qda.prob <- predict(qda.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
qda_roc <- roc(df_factor $Blue_Tarp_or_Not, qda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="QDA ROC Curve")
An AUC of 99.8% indicates that our model is able to distinguish between the classes very well. It should be possible to choose a threshold that can identify most of the training blue tarps without going over our budget.
roc.info_qda <- roc(df_factor$Blue_Tarp_or_Not, qda.prob[,2], legacy.axes=TRUE)
roc.qda.df <- data.frame(tpp=roc.info_qda$sensitivities*100, fpp=(1-roc.info_qda$specificities)*100, thresholds=roc.info_qda$thresholds)
#roc.qda.df[roc.qda.df>98 & roc.qda.df < 99,]
fig4 <- plot_ly(roc.qda.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig4 <- fig4 %>% add_markers()
fig4 <- fig4 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig4
qda.thresh <- 0.02
qda.pred_thresh <- factor(ifelse(qda.prob[,2]>qda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.qda_thresh <- confusionMatrix(factor(qda.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
qda.thresh
cm.qda_thresh
acc_qda <- cm.qda_thresh[["overall"]][["Accuracy"]]*100
auc_qda <- qda_roc[["auc"]]
thresh_qda <- qda.thresh
sens_qda <- cm.qda_thresh[["byClass"]][["Sensitivity"]]*100
spec_qda <- cm.qda_thresh[["byClass"]][["Specificity"]]*100
FDR_qda <- ((cm.qda_thresh[["table"]][2,1])/(cm.qda_thresh[["table"]][2,1]+cm.qda_thresh[["table"]][2,2]))*100
prec_qda <- cm.qda_thresh[["byClass"]][["Precision"]]*100
F.qda <- round(2*((prec_qda*sens_qda)/(prec_qda+sens_qda))/100,2)
cost.qda <- round(cm.qda_thresh[["table"]][1]*0 + cm.qda_thresh[["table"]][2]*FP_C + cm.qda_thresh[["table"]][3]*FN_C + cm.qda_thresh[["table"]][4]*TP_C, 4)
cost.qda
#> [1] "Threshold:"
#> [1] 0.02
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 59807 14
#> BT 1412 2008
#>
#> Accuracy : 0.9775
#> 95% CI : (0.9763, 0.9786)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.727
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.99308
#> Specificity : 0.97694
#> Pos Pred Value : 0.58713
#> Neg Pred Value : 0.99977
#> Prevalence : 0.03197
#> Detection Rate : 0.03175
#> Detection Prevalence : 0.05408
#> Balanced Accuracy : 0.98501
#>
#> 'Positive' Class : BT
#>
#> [1] -1998844
"10 Fold Results"
qda.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
qda.sd <- sd(qda.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#> Accuracy Kappa parameter Resample
#> 1 0.9954143 0.9203183 none Fold01
#> 2 0.9958887 0.9298247 none Fold02
#> 3 0.9938330 0.8917279 none Fold03
#> 4 0.9941502 0.8978148 none Fold04
#> 5 0.9938330 0.8911622 none Fold05
#> 6 0.9954143 0.9215293 none Fold06
#> 7 0.9928843 0.8710492 none Fold07
#> 8 0.9957306 0.9269411 none Fold08
#> 9 0.9947818 0.9088589 none Fold09
#> 10 0.9938330 0.8922908 none Fold10
The average accuracy across ten folds is 97.75 with a standard deviation of 0.101. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.
K nearest neighbors is a modeling method that calculates the probability an observation belongs to one class or another based on the class of its k nearest neighbors (observations closest in data space to the observation we’re trying to classify) are assigned to. K values of 5 or 10 are often chosen to avoid too much bias (happens when k is too big) and too much variance (happens when K is too small).
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
tic("KNN")
knn.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_factor,
preProcess=c("center","scale"),
method="knn",
trControl= fitControl,
tuneLength=10
)
knn.time <- toc(quiet = TRUE)
knn.fit
#> k-Nearest Neighbors
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 5 0.9972328 0.9555000
#> 7 0.9973435 0.9573825
#> 9 0.9972803 0.9563272
#> 11 0.9971221 0.9537932
#> 13 0.9972170 0.9552906
#> 15 0.9971379 0.9540331
#> 17 0.9970431 0.9524698
#> 19 0.9969956 0.9517428
#> 21 0.9969956 0.9516839
#> 23 0.9969482 0.9509483
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 7.
The model fit indicates that the highest accuracy is obtained when k is 7.
plot(knn.fit)
#pass
knn.prob <- predict(knn.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
knn_roc <- roc(df_factor $Blue_Tarp_or_Not, knn.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="KNN ROC Curve")
An AUC of 100.0% indicates that our model is able to distinguish between the classes perfectly. It should be possible to choose a threshold that can identify all of the training blue tarps without going over our budget.
roc.info_knn <- roc(df_factor$Blue_Tarp_or_Not, knn.prob[,2], legacy.axes=TRUE)
roc.knn.df <- data.frame(tpp=roc.info_knn$sensitivities*100, fpp=(1-roc.info_knn$specificities)*100, thresholds=roc.info_knn$thresholds)
#roc.knn.df[roc.knn.df>99 & roc.knn.df < 100,]
#roc.knn.df
fig5 <- plot_ly(roc.knn.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig5 <- fig5 %>% add_markers()
fig5 <- fig5 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig5
knn.thresh <- 0.07
knn.pred_thresh <- factor(ifelse(knn.prob[,2]>knn.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.knn_thresh <- confusionMatrix(factor(knn.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
knn.thresh
cm.knn_thresh
acc_knn <- cm.knn_thresh[["overall"]][["Accuracy"]]*100
auc_knn <- knn_roc[["auc"]]
thresh_knn <- knn.thresh
sens_knn <- cm.knn_thresh[["byClass"]][["Sensitivity"]]*100
spec_knn <- cm.knn_thresh[["byClass"]][["Specificity"]]*100
FDR_knn <- ((cm.knn_thresh[["table"]][2,1])/(cm.knn_thresh[["table"]][2,1]+cm.knn_thresh[["table"]][2,2]))*100
prec_knn <- cm.knn_thresh[["byClass"]][["Precision"]]*100
k_knn <- knn.fit[["bestTune"]][["k"]]
F.knn <- round(2*((prec_knn*sens_knn)/(prec_knn+sens_knn))/100,2)
cost.knn <- round(cm.knn_thresh[["table"]][1]*0 + cm.knn_thresh[["table"]][2]*FP_C + cm.knn_thresh[["table"]][3]*FN_C + cm.knn_thresh[["table"]][4]*TP_C,4)
cost.knn
#> [1] "Threshold:"
#> [1] 0.07
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 60685 0
#> BT 534 2022
#>
#> Accuracy : 0.9916
#> 95% CI : (0.9908, 0.9923)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.879
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 1.00000
#> Specificity : 0.99128
#> Pos Pred Value : 0.79108
#> Neg Pred Value : 1.00000
#> Prevalence : 0.03197
#> Detection Rate : 0.03197
#> Detection Prevalence : 0.04042
#> Balanced Accuracy : 0.99564
#>
#> 'Positive' Class : BT
#>
#> [1] -1676700
"10 Fold Results"
knn.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
knn.sd <- sd(knn.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#> Accuracy Kappa k Resample
#> 1 0.9977862 0.9638571 5 Fold01
#> 2 0.9974700 0.9586938 7 Fold01
#> 3 0.9974700 0.9586938 9 Fold01
#> 4 0.9976281 0.9613688 11 Fold01
#> 5 0.9977862 0.9638571 13 Fold01
#> 6 0.9977862 0.9638571 15 Fold01
#> 7 0.9976281 0.9611816 17 Fold01
#> 8 0.9976281 0.9611816 19 Fold01
#> 9 0.9976281 0.9611816 21 Fold01
#> 10 0.9976281 0.9611816 23 Fold01
#> 11 0.9979443 0.9668394 5 Fold02
#> 12 0.9973118 0.9570464 7 Fold02
#> 13 0.9974700 0.9594773 9 Fold02
#> 14 0.9974700 0.9594773 11 Fold02
#> 15 0.9973118 0.9568423 13 Fold02
#> 16 0.9971537 0.9544119 15 Fold02
#> 17 0.9971537 0.9544119 17 Fold02
#> 18 0.9973118 0.9568423 19 Fold02
#> 19 0.9973118 0.9566362 21 Fold02
#> 20 0.9973118 0.9566362 23 Fold02
#> 21 0.9974700 0.9598579 5 Fold03
#> 22 0.9968374 0.9500570 7 Fold03
#> 23 0.9968374 0.9495856 9 Fold03
#> 24 0.9973118 0.9570464 11 Fold03
#> 25 0.9974700 0.9596685 13 Fold03
#> 26 0.9974700 0.9596685 15 Fold03
#> 27 0.9971537 0.9546271 17 Fold03
#> 28 0.9974700 0.9596685 19 Fold03
#> 29 0.9976281 0.9620998 21 Fold03
#> 30 0.9971537 0.9548402 23 Fold03
#> 31 0.9965217 0.9437481 5 Fold04
#> 32 0.9971542 0.9544120 7 Fold04
#> 33 0.9968379 0.9493467 9 Fold04
#> 34 0.9965217 0.9442814 11 Fold04
#> 35 0.9969960 0.9517651 13 Fold04
#> 36 0.9966798 0.9466877 15 Fold04
#> 37 0.9966798 0.9466877 17 Fold04
#> 38 0.9965217 0.9442814 19 Fold04
#> 39 0.9963636 0.9418865 21 Fold04
#> 40 0.9962055 0.9392161 23 Fold04
#> 41 0.9976281 0.9613688 5 Fold05
#> 42 0.9977862 0.9642031 7 Fold05
#> 43 0.9976281 0.9613688 9 Fold05
#> 44 0.9973118 0.9562180 11 Fold05
#> 45 0.9974700 0.9588925 13 Fold05
#> 46 0.9974700 0.9588925 15 Fold05
#> 47 0.9974700 0.9588925 17 Fold05
#> 48 0.9973118 0.9562180 19 Fold05
#> 49 0.9974700 0.9588925 21 Fold05
#> 50 0.9974700 0.9588925 23 Fold05
#> 51 0.9974700 0.9588925 5 Fold06
#> 52 0.9979443 0.9666803 7 Fold06
#> 53 0.9974700 0.9590893 9 Fold06
#> 54 0.9973118 0.9562180 11 Fold06
#> 55 0.9971537 0.9537541 13 Fold06
#> 56 0.9974700 0.9588925 15 Fold06
#> 57 0.9973118 0.9564281 17 Fold06
#> 58 0.9973118 0.9564281 19 Fold06
#> 59 0.9974700 0.9588925 21 Fold06
#> 60 0.9973118 0.9564281 23 Fold06
#> 61 0.9971537 0.9537541 5 Fold07
#> 62 0.9977862 0.9638571 7 Fold07
#> 63 0.9977862 0.9640309 9 Fold07
#> 64 0.9974700 0.9588925 11 Fold07
#> 65 0.9976281 0.9615542 13 Fold07
#> 66 0.9971537 0.9537541 15 Fold07
#> 67 0.9973118 0.9564281 17 Fold07
#> 68 0.9971537 0.9537541 19 Fold07
#> 69 0.9973118 0.9564281 21 Fold07
#> 70 0.9973118 0.9564281 23 Fold07
#> 71 0.9976281 0.9622782 5 Fold08
#> 72 0.9976281 0.9622782 7 Fold08
#> 73 0.9974700 0.9598579 9 Fold08
#> 74 0.9973118 0.9574490 11 Fold08
#> 75 0.9974700 0.9598579 13 Fold08
#> 76 0.9974700 0.9598579 15 Fold08
#> 77 0.9973118 0.9570464 17 Fold08
#> 78 0.9968374 0.9498224 19 Fold08
#> 79 0.9968374 0.9495856 21 Fold08
#> 80 0.9969956 0.9519931 23 Fold08
#> 81 0.9966793 0.9466875 5 Fold09
#> 82 0.9969956 0.9515345 7 Fold09
#> 83 0.9969956 0.9517649 9 Fold09
#> 84 0.9965212 0.9442812 11 Fold09
#> 85 0.9965212 0.9440158 13 Fold09
#> 86 0.9963631 0.9416101 15 Fold09
#> 87 0.9962049 0.9389263 17 Fold09
#> 88 0.9960468 0.9365327 19 Fold09
#> 89 0.9958887 0.9335201 21 Fold09
#> 90 0.9958887 0.9335201 23 Fold09
#> 91 0.9960468 0.9377164 5 Fold10
#> 92 0.9965212 0.9450623 7 Fold10
#> 93 0.9968374 0.9500567 9 Fold10
#> 94 0.9963631 0.9426991 11 Fold10
#> 95 0.9963631 0.9426991 13 Fold10
#> 96 0.9963631 0.9426991 15 Fold10
#> 97 0.9962049 0.9400680 17 Fold10
#> 98 0.9963631 0.9426991 19 Fold10
#> 99 0.9960468 0.9377164 21 Fold10
#> 100 0.9962049 0.9403468 23 Fold10
The average accuracy across ten folds when k = 7 is 99.16 with a standard deviation of 0.052. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.
Random forest is a classification method that uses a large number of averaged decorrelated decision trees to assign a class to an observation. Decorrelation is accomplished by limiting the number of predictors the trees may use to split the data. To assign a class to a new observation, the observation is classified by all the trees in the model and assigned a final classification based on which class the largest number of trees assigned to that observation.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
tic("RF")
rf.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_factor,
preProcess=c("center","scale"),
method="rf",
trControl= fitControl,
tuneLength=3
)
rf.time <- toc(quiet = TRUE)
rf.fit
#> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
#>
#> Random Forest
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.9969798 0.9509141
#> 3 0.9969482 0.9504008
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
The RF model was fit on 500 trees and the tuning parameter mtry=2 (number of variables chosen at each sample split) was selected. The difference in accuracy between mtry=2 and mtry=3 incredibly small (hundred thousandths place).
plot(rf.fit)
#pass
RF.prob <- predict(rf.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
RF_roc <- roc(df_factor $Blue_Tarp_or_Not, RF.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="RF ROC Curve")
An AUC of 99.4% indicates that our model is able to distinguish between the classes pretty well. It is likely that a threshold that can identify most of the training blue tarps without going over our budget can be chosen.
roc.info_rf <- roc(df_factor$Blue_Tarp_or_Not, RF.prob[,2], legacy.axes=TRUE)
roc.rf.df <- data.frame(tpp=roc.info_rf$sensitivities*100, fpp=(1-roc.info_rf$specificities)*100, thresholds=roc.info_rf$thresholds)
#roc.rf.df[roc.rf.df>99 & roc.rf.df < 100,]
#roc.rf.df
fig6 <- plot_ly(roc.rf.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig6 <- fig6 %>% add_markers()
fig6 <- fig6 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig6
RF.thresh <- 0.36
RF.pred_thresh <- factor(ifelse(RF.prob[,2]>RF.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.RF_thresh <- confusionMatrix(factor(RF.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
RF.thresh
cm.RF_thresh
acc_RF <- cm.RF_thresh[["overall"]][["Accuracy"]]*100
auc_RF <- RF_roc[["auc"]]
thresh_RF <- RF.thresh
sens_RF <- cm.RF_thresh[["byClass"]][["Sensitivity"]]*100
spec_RF <- cm.RF_thresh[["byClass"]][["Specificity"]]*100
FDR_RF <- ((cm.RF_thresh[["table"]][2,1])/(cm.RF_thresh[["table"]][2,1]+cm.RF_thresh[["table"]][2,2]))*100
prec_RF <- cm.RF_thresh[["byClass"]][["Precision"]]*100
mtry_best <- rf.fit[["bestTune"]][["mtry"]]
ntree <- rf.fit[["finalModel"]][["ntree"]]
F.rf <- round(2*((prec_RF*sens_RF)/(prec_RF+sens_RF))/100,2)
cost.RF <- round(cm.RF_thresh[["table"]][1]*0 + cm.RF_thresh[["table"]][2]*FP_C + cm.RF_thresh[["table"]][3]*FN_C + cm.RF_thresh[["table"]][4]*TP_C,4)
cost.RF
#> [1] "Threshold:"
#> [1] 0.36
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 61203 22
#> BT 16 2000
#>
#> Accuracy : 0.9994
#> 95% CI : (0.9992, 0.9996)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9903
#>
#> Mcnemar's Test P-Value : 0.4173
#>
#> Sensitivity : 0.98912
#> Specificity : 0.99974
#> Pos Pred Value : 0.99206
#> Neg Pred Value : 0.99964
#> Prevalence : 0.03197
#> Detection Rate : 0.03163
#> Detection Prevalence : 0.03188
#> Balanced Accuracy : 0.99443
#>
#> 'Positive' Class : BT
#>
#> [1] -1613612
"10 Fold Results"
rf.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
rf.sd <- sd(rf.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#> Accuracy Kappa mtry Resample
#> 1 0.9971537 0.9528468 2 Fold01
#> 2 0.9973118 0.9555754 3 Fold01
#> 3 0.9973118 0.9560058 2 Fold02
#> 4 0.9974700 0.9588925 3 Fold02
#> 5 0.9963631 0.9416101 2 Fold03
#> 6 0.9965212 0.9442812 3 Fold03
#> 7 0.9966798 0.9453903 2 Fold04
#> 8 0.9968379 0.9476083 3 Fold04
#> 9 0.9971537 0.9533048 2 Fold05
#> 10 0.9963631 0.9401887 3 Fold05
#> 11 0.9977862 0.9643737 2 Fold06
#> 12 0.9976281 0.9617378 3 Fold06
#> 13 0.9969956 0.9505907 2 Fold07
#> 14 0.9971537 0.9530769 3 Fold07
#> 15 0.9971537 0.9537541 2 Fold08
#> 16 0.9973118 0.9564281 3 Fold08
#> 17 0.9968374 0.9491052 2 Fold09
#> 18 0.9966793 0.9464329 3 Fold09
#> 19 0.9963631 0.9421597 2 Fold10
#> 20 0.9962049 0.9397866 3 Fold10
The average accuracy across ten folds when mtry = 2 is 99.94 with a standard deviation of 0.045. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.
Support vector machines or SVMs are a classification (or regression) technique that generates a hyperplane in N-dimensional space to separate the classes. This is accomplished by choosing the hyperplane that creates the largest margin between the classes. SVM is touted as one of the best classifiers when the data is numeric and continuous (which is not our use case though we will see SVm does a good job predicting on our data set) and the right kernel is used.
Support vectors are data points that are closest to the hyperplane and influence its position and orientation. They fall within the allowable margin on either side of the hyperplane. The SVM model produces a score between -1 and 1 for each observation. The sign of the score indicates which class the model assigned the observation to.
Based on the way caret has been programmed, the predict function creates probabilities instead of scores. This will allow a threshold to be chosen to generate confusion matrices.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
tic("SVM")
svm.radial.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_factor,
preProcess=c("center","scale"),
method="svmRadial",
trControl= fitControl,
tuneLength=10
#tuneGrid = expand.grid(C=seq(0,10, length=10),
# sigma =seq(0,10, length=10))
)
svm.time <- toc(quiet = TRUE)
svm.radial.fit
"Summary"
summary(svm.radial.fit)
#> Support Vector Machines with Radial Basis Function Kernel
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> C Accuracy Kappa
#> 0.25 0.9967268 0.9463658
#> 0.50 0.9968849 0.9489059
#> 1.00 0.9969798 0.9505083
#> 2.00 0.9970589 0.9518827
#> 4.00 0.9971063 0.9526307
#> 8.00 0.9971854 0.9538882
#> 16.00 0.9972170 0.9545384
#> 32.00 0.9973909 0.9575077
#> 64.00 0.9973751 0.9574081
#> 128.00 0.9974384 0.9584802
#>
#> Tuning parameter 'sigma' was held constant at a value of 9.020118
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 9.020118 and C = 128.
#> [1] "Summary"
#> Length Class Mode
#> 1 ksvm S4
plot(svm.radial.fit)
Both linear and poly SVM functions were considered. Radial SVM produced the highest accuracy values, by less than a percentage point, of the three methods. SVM radial was chosen for building the SVM model.
The best tuning parameters selected for the radial model fit were sigma = 9.020118 and C = 128.
#pass
SVM.prob <- predict(svm.radial.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
SVM_roc <- roc(df_factor $Blue_Tarp_or_Not, SVM.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="SVM ROC Curve")
An AUC of 99.9% indicates that our model is able to distinguish between the classes almost perfectly. It should be possible to choose a threshold that can identify most of the training blue tarps without going over our budget.
roc.info_svm <- roc(df_factor$Blue_Tarp_or_Not, SVM.prob[,2], legacy.axes=TRUE)
roc.svm.df <- data.frame(tpp=roc.info_svm$sensitivities*100, fpp=(1-roc.info_svm$specificities)*100, thresholds=roc.info_svm$thresholds)
#roc.svm.df[roc.svm.df>99 & roc.svm.df < 100,]
#roc.svm.df
fig7 <- plot_ly(roc.svm.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig7 <- fig7 %>% add_markers()
fig7 <- fig7 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig7
SVM.thresh <- 0.0048
SVM.pred_thresh <- factor(ifelse(SVM.prob[,2]>SVM.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.SVM_thresh <- confusionMatrix(factor(SVM.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
SVM.thresh
cm.SVM_thresh
acc_SVM <- cm.SVM_thresh[["overall"]][["Accuracy"]]*100
auc_SVM <- SVM_roc[["auc"]]
thresh_SVM <- SVM.thresh
sens_SVM <- cm.SVM_thresh[["byClass"]][["Sensitivity"]]*100
spec_SVM <- cm.SVM_thresh[["byClass"]][["Specificity"]]*100
FDR_SVM <- ((cm.SVM_thresh[["table"]][2,1])/(cm.SVM_thresh[["table"]][2,1]+cm.SVM_thresh[["table"]][2,2]))*100
prec_SVM <- cm.SVM_thresh[["byClass"]][["Precision"]]*100
sigma_best <- round(svm.radial.fit[["bestTune"]][["sigma"]], 4)
C_best <- svm.radial.fit[["bestTune"]][["C"]]
F.svm <- round(2*((prec_SVM*sens_SVM)/(prec_SVM+sens_SVM))/100,2)
cost.SVM <- round(cm.SVM_thresh[["table"]][1]*0 + cm.SVM_thresh[["table"]][2]*FP_C + cm.SVM_thresh[["table"]][3]*FN_C + cm.SVM_thresh[["table"]][4]*TP_C, 4)
cost.SVM
#> [1] "Threshold:"
#> [1] 0.0048
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 60374 4
#> BT 845 2018
#>
#> Accuracy : 0.9866
#> 95% CI : (0.9856, 0.9875)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.8194
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.99802
#> Specificity : 0.98620
#> Pos Pred Value : 0.70486
#> Neg Pred Value : 0.99993
#> Prevalence : 0.03197
#> Detection Rate : 0.03191
#> Detection Prevalence : 0.04527
#> Balanced Accuracy : 0.99211
#>
#> 'Positive' Class : BT
#>
#> [1] -1786784
"10 Fold Results"
svm.radial.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
svm.sd <- sd(svm.radial.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#> Accuracy Kappa sigma C Resample
#> 1 0.9966793 0.9448527 9.020118 0.25 Fold01
#> 2 0.9968374 0.9476075 9.020118 0.50 Fold01
#> 3 0.9973118 0.9555754 9.020118 1.00 Fold01
#> 4 0.9973118 0.9555754 9.020118 2.00 Fold01
#> 5 0.9974700 0.9580860 9.020118 4.00 Fold01
#> 6 0.9974700 0.9580860 9.020118 8.00 Fold01
#> 7 0.9974700 0.9580860 9.020118 16.00 Fold01
#> 8 0.9974700 0.9580860 9.020118 32.00 Fold01
#> 9 0.9974700 0.9582906 9.020118 64.00 Fold01
#> 10 0.9974700 0.9582906 9.020118 128.00 Fold01
#> 11 0.9966793 0.9456543 9.020118 0.25 Fold02
#> 12 0.9971537 0.9535305 9.020118 0.50 Fold02
#> 13 0.9973118 0.9560058 9.020118 1.00 Fold02
#> 14 0.9973118 0.9562180 9.020118 2.00 Fold02
#> 15 0.9971537 0.9535305 9.020118 4.00 Fold02
#> 16 0.9974700 0.9586938 9.020118 8.00 Fold02
#> 17 0.9974700 0.9586938 9.020118 16.00 Fold02
#> 18 0.9977862 0.9640309 9.020118 32.00 Fold02
#> 19 0.9977862 0.9643737 9.020118 64.00 Fold02
#> 20 0.9976281 0.9619196 9.020118 128.00 Fold02
#> 21 0.9966793 0.9453897 9.020118 0.25 Fold03
#> 22 0.9971537 0.9535305 9.020118 0.50 Fold03
#> 23 0.9974700 0.9588925 9.020118 1.00 Fold03
#> 24 0.9974700 0.9590893 9.020118 2.00 Fold03
#> 25 0.9974700 0.9590893 9.020118 4.00 Fold03
#> 26 0.9976281 0.9615542 9.020118 8.00 Fold03
#> 27 0.9973118 0.9566362 9.020118 16.00 Fold03
#> 28 0.9971537 0.9544119 9.020118 32.00 Fold03
#> 29 0.9969956 0.9517649 9.020118 64.00 Fold03
#> 30 0.9969956 0.9517649 9.020118 128.00 Fold03
#> 31 0.9963636 0.9410501 9.020118 0.25 Fold04
#> 32 0.9960474 0.9356152 9.020118 0.50 Fold04
#> 33 0.9963636 0.9407660 9.020118 1.00 Fold04
#> 34 0.9963636 0.9407660 9.020118 2.00 Fold04
#> 35 0.9963636 0.9407660 9.020118 4.00 Fold04
#> 36 0.9962055 0.9383392 9.020118 8.00 Fold04
#> 37 0.9965217 0.9437481 9.020118 16.00 Fold04
#> 38 0.9969960 0.9515348 9.020118 32.00 Fold04
#> 39 0.9973123 0.9568424 9.020118 64.00 Fold04
#> 40 0.9974704 0.9594774 9.020118 128.00 Fold04
#> 41 0.9971537 0.9530769 9.020118 0.25 Fold05
#> 42 0.9974700 0.9580860 9.020118 0.50 Fold05
#> 43 0.9971537 0.9530769 9.020118 1.00 Fold05
#> 44 0.9973118 0.9555754 9.020118 2.00 Fold05
#> 45 0.9973118 0.9553569 9.020118 4.00 Fold05
#> 46 0.9973118 0.9553569 9.020118 8.00 Fold05
#> 47 0.9974700 0.9580860 9.020118 16.00 Fold05
#> 48 0.9974700 0.9580860 9.020118 32.00 Fold05
#> 49 0.9974700 0.9582906 9.020118 64.00 Fold05
#> 50 0.9976281 0.9609926 9.020118 128.00 Fold05
#> 51 0.9966793 0.9448527 9.020118 0.25 Fold06
#> 52 0.9966793 0.9448527 9.020118 0.50 Fold06
#> 53 0.9968374 0.9478633 9.020118 1.00 Fold06
#> 54 0.9968374 0.9478633 9.020118 2.00 Fold06
#> 55 0.9971537 0.9530769 9.020118 4.00 Fold06
#> 56 0.9973118 0.9557916 9.020118 8.00 Fold06
#> 57 0.9973118 0.9557916 9.020118 16.00 Fold06
#> 58 0.9977862 0.9638571 9.020118 32.00 Fold06
#> 59 0.9976281 0.9611816 9.020118 64.00 Fold06
#> 60 0.9977862 0.9640309 9.020118 128.00 Fold06
#> 61 0.9969956 0.9505907 9.020118 0.25 Fold07
#> 62 0.9969956 0.9505907 9.020118 0.50 Fold07
#> 63 0.9968374 0.9476075 9.020118 1.00 Fold07
#> 64 0.9971537 0.9530769 9.020118 2.00 Fold07
#> 65 0.9971537 0.9530769 9.020118 4.00 Fold07
#> 66 0.9969956 0.9501048 9.020118 8.00 Fold07
#> 67 0.9973118 0.9555754 9.020118 16.00 Fold07
#> 68 0.9974700 0.9582906 9.020118 32.00 Fold07
#> 69 0.9974700 0.9584932 9.020118 64.00 Fold07
#> 70 0.9974700 0.9582906 9.020118 128.00 Fold07
#> 71 0.9968374 0.9486156 9.020118 0.25 Fold08
#> 72 0.9971537 0.9535305 9.020118 0.50 Fold08
#> 73 0.9974700 0.9586938 9.020118 1.00 Fold08
#> 74 0.9976281 0.9613688 9.020118 2.00 Fold08
#> 75 0.9977862 0.9640309 9.020118 4.00 Fold08
#> 76 0.9977862 0.9640309 9.020118 8.00 Fold08
#> 77 0.9977862 0.9640309 9.020118 16.00 Fold08
#> 78 0.9981025 0.9693170 9.020118 32.00 Fold08
#> 79 0.9977862 0.9643737 9.020118 64.00 Fold08
#> 80 0.9977862 0.9642031 9.020118 128.00 Fold08
#> 81 0.9962049 0.9380407 9.020118 0.25 Fold09
#> 82 0.9962049 0.9377398 9.020118 0.50 Fold09
#> 83 0.9962049 0.9377398 9.020118 1.00 Fold09
#> 84 0.9963631 0.9401887 9.020118 2.00 Fold09
#> 85 0.9963631 0.9401887 9.020118 4.00 Fold09
#> 86 0.9966793 0.9453897 9.020118 8.00 Fold09
#> 87 0.9966793 0.9453897 9.020118 16.00 Fold09
#> 88 0.9968374 0.9481165 9.020118 32.00 Fold09
#> 89 0.9971537 0.9535305 9.020118 64.00 Fold09
#> 90 0.9974700 0.9588925 9.020118 128.00 Fold09
#> 91 0.9969956 0.9515345 9.020118 0.25 Fold10
#> 92 0.9971537 0.9539755 9.020118 0.50 Fold10
#> 93 0.9968374 0.9488616 9.020118 1.00 Fold10
#> 94 0.9968374 0.9491052 9.020118 2.00 Fold10
#> 95 0.9968374 0.9491052 9.020118 4.00 Fold10
#> 96 0.9969956 0.9515345 9.020118 8.00 Fold10
#> 97 0.9968374 0.9493464 9.020118 16.00 Fold10
#> 98 0.9968374 0.9493464 9.020118 32.00 Fold10
#> 99 0.9966793 0.9469395 9.020118 64.00 Fold10
#> 100 0.9966793 0.9469395 9.020118 128.00 Fold10
The average accuracy across ten folds is 98.66 with a standard deviation of 0.045. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.
df_FHO <- tibble(read.csv("SYS6018_FHO_Data.csv"))
df_FHO$Blue_Tarp_or_Not <- factor(df_FHO$Blue_Tarp_or_Not, labels = c("NBT", "BT"))
df_FHO
#> # A tibble: 2,004,177 x 4
#> Red Green Blue Blue_Tarp_or_Not
#> <int> <int> <int> <fct>
#> 1 104 89 63 NBT
#> 2 101 80 60 NBT
#> 3 103 87 69 NBT
#> 4 107 93 72 NBT
#> 5 109 99 68 NBT
#> 6 103 73 53 NBT
#> 7 100 79 56 NBT
#> 8 98 70 51 NBT
#> 9 97 73 56 NBT
#> 10 99 79 61 NBT
#> # ... with 2,004,167 more rows
tic("Log.Reg. Predict")
glm.prob_FHO <- predict(glm.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
glm.predict.time <- toc(quiet = TRUE)
lr.thresh_FHO <- lr.thresh
glm.pred_thresh_FHO <- factor(ifelse(glm.prob_FHO[,2]>lr.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.glm_thresh_FHO <- confusionMatrix(factor(glm.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
lr.thresh_FHO
cm.glm_thresh_FHO
acc_LR_FHO <- round(cm.glm_thresh_FHO[["overall"]][["Accuracy"]]*100, 2)
auc_LR_FHO <- round(glm_roc[["auc"]], 2)
thresh_LR_FHO <- lr.thresh_FHO
sens_LR_FHO <- round(cm.glm_thresh_FHO[["byClass"]][["Sensitivity"]]*100, 2)
spec_LR_FHO <- round(cm.glm_thresh_FHO[["byClass"]][["Specificity"]]*100, 2)
FDR_LR_FHO <- round(((cm.glm_thresh_FHO[["table"]][2,1])/(cm.glm_thresh_FHO[["table"]][2,1]+cm.glm_thresh_FHO[["table"]][2,2]))*100, 2)
prec_LR_FHO <- round(cm.glm_thresh_FHO[["byClass"]][["Precision"]]*100, 2)
F.glm_FHO <- round(2*((prec_LR_FHO*sens_LR_FHO)/(prec_LR_FHO+sens_LR_FHO))/100, 2)
#> [1] "Threshold:"
#> [1] 0.03265
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 1767335 0
#> BT 222362 14480
#>
#> Accuracy : 0.8891
#> 95% CI : (0.8886, 0.8895)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.103
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 1.000000
#> Specificity : 0.888243
#> Pos Pred Value : 0.061138
#> Neg Pred Value : 1.000000
#> Prevalence : 0.007225
#> Detection Rate : 0.007225
#> Detection Prevalence : 0.118174
#> Balanced Accuracy : 0.944122
#>
#> 'Positive' Class : BT
#>
par(pty="s")
glm_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, glm.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="Log. Reg. FHO ROC Curve")
auc_LR_FHO <- round(glm_roc_FHO[["auc"]], 2)
tic("LDA Predict")
lda.prob_FHO <- predict(lda.fit, newdata=df_FHO, type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
lda.predict.time <- toc(quiet = TRUE)
lda.thresh_FHO <- lda.thresh
lda.pred_thresh_FHO <- factor(ifelse(lda.prob_FHO[,2]>lda.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.lda_thresh_FHO <- confusionMatrix(factor(lda.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
lda.thresh_FHO
cm.lda_thresh_FHO
acc_lda_FHO <- round(cm.lda_thresh_FHO[["overall"]][["Accuracy"]]*100, 2)
thresh_lda_FHO <- lda.thresh_FHO
sens_lda_FHO <- round(cm.lda_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_lda_FHO <- round(cm.lda_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_lda_FHO <- round(((cm.lda_thresh_FHO[["table"]][2,1])/(cm.lda_thresh_FHO[["table"]][2,1]+cm.lda_thresh_FHO[["table"]][2,2]))*100,2)
prec_lda_FHO <- round(cm.lda_thresh_FHO[["byClass"]][["Precision"]]*100, 2)
F.lda_FHO <- round(2*((prec_lda_FHO*sens_lda_FHO)/(prec_lda_FHO+sens_lda_FHO))/100, 2)
#> [1] "Threshold:"
#> [1] 0.01
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 1926064 620
#> BT 63633 13860
#>
#> Accuracy : 0.9679
#> 95% CI : (0.9677, 0.9682)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2928
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.957182
#> Specificity : 0.968019
#> Pos Pred Value : 0.178855
#> Neg Pred Value : 0.999678
#> Prevalence : 0.007225
#> Detection Rate : 0.006916
#> Detection Prevalence : 0.038666
#> Balanced Accuracy : 0.962601
#>
#> 'Positive' Class : BT
#>
par(pty="s")
lda_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, lda.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="LDA FHO ROC Curve")
auc_lda_FHO <- round(lda_roc_FHO[["auc"]],2)
tic("QDA Predict")
qda.prob_FHO <- predict(qda.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
qda.predict.time <- toc(quiet = TRUE)
qda.thresh_FHO <- qda.thresh
qda.pred_thresh_FHO <- factor(ifelse(qda.prob_FHO[,2]>qda.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.qda_thresh_FHO <- confusionMatrix(factor(qda.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
qda.thresh_FHO
cm.qda_thresh_FHO
acc_qda_FHO <- round(cm.qda_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
thresh_qda_FHO <- qda.thresh_FHO
sens_qda_FHO <- round(cm.qda_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_qda_FHO <- round(cm.qda_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_qda_FHO <- round(((cm.qda_thresh_FHO[["table"]][2,1])/(cm.qda_thresh_FHO[["table"]][2,1]+cm.qda_thresh_FHO[["table"]][2,2]))*100,2)
prec_qda_FHO <- round(cm.qda_thresh_FHO[["byClass"]][["Precision"]]*100,2)
F.qda_FHO <- round(2*((prec_qda_FHO*sens_qda_FHO)/(prec_qda_FHO+sens_qda_FHO))/100,2)
#> [1] "Threshold:"
#> [1] 0.02
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 1925760 1103
#> BT 63937 13377
#>
#> Accuracy : 0.9675
#> 95% CI : (0.9673, 0.9678)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2827
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.923826
#> Specificity : 0.967866
#> Pos Pred Value : 0.173022
#> Neg Pred Value : 0.999428
#> Prevalence : 0.007225
#> Detection Rate : 0.006675
#> Detection Prevalence : 0.038576
#> Balanced Accuracy : 0.945846
#>
#> 'Positive' Class : BT
#>
par(pty="s")
qda_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, qda.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="QDA FHO ROC Curve")
auc_qda_FHO <- round(qda_roc_FHO[["auc"]],2)
tic("KNN Predict")
knn.prob_FHO <- predict(knn.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
knn.predict.time <- toc(quiet = TRUE)
knn.thresh_FHO <- knn.thresh
knn.pred_thresh_FHO <- factor(ifelse(knn.prob_FHO[,2]>knn.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.knn_thresh_FHO <- confusionMatrix(factor(knn.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
knn.thresh_FHO
cm.knn_thresh_FHO
acc_knn_FHO <- round(cm.knn_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
thresh_knn_FHO <- knn.thresh_FHO
sens_knn_FHO <- round(cm.knn_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_knn_FHO <- round(cm.knn_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_knn_FHO <- round(((cm.knn_thresh_FHO[["table"]][2,1])/(cm.knn_thresh_FHO[["table"]][2,1]+cm.knn_thresh_FHO[["table"]][2,2]))*100,2)
prec_knn_FHO <- round(cm.knn_thresh_FHO[["byClass"]][["Precision"]]*100,2)
F.knn_FHO <- round(2*((prec_knn_FHO*sens_knn_FHO)/(prec_knn_FHO+sens_knn_FHO))/100,2)
#> [1] "Threshold:"
#> [1] 0.07
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 1941216 988
#> BT 48481 13492
#>
#> Accuracy : 0.9753
#> 95% CI : (0.9751, 0.9755)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3453
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.931768
#> Specificity : 0.975634
#> Pos Pred Value : 0.217708
#> Neg Pred Value : 0.999491
#> Prevalence : 0.007225
#> Detection Rate : 0.006732
#> Detection Prevalence : 0.030922
#> Balanced Accuracy : 0.953701
#>
#> 'Positive' Class : BT
#>
par(pty="s")
knn_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, knn.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="KNN FHO ROC Curve")
auc_knn_FHO <- round(knn_roc_FHO[["auc"]],2)
tic("RF Predict")
RF.prob_FHO <- predict(rf.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
rf.predict.time <- toc(quiet = TRUE)
RF.thresh_FHO <- RF.thresh
RF.pred_thresh_FHO <- factor(ifelse(RF.prob_FHO[,2]>RF.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.RF_thresh_FHO <- confusionMatrix(factor(RF.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
RF.thresh_FHO
cm.RF_thresh_FHO
acc_RF_FHO <- round(cm.RF_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
thresh_RF_FHO <- RF.thresh
sens_RF_FHO <- round(cm.RF_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_RF_FHO <- round(cm.RF_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_RF_FHO <- round(((cm.RF_thresh_FHO[["table"]][2,1])/(cm.RF_thresh_FHO[["table"]][2,1]+cm.RF_thresh_FHO[["table"]][2,2]))*100,2)
prec_RF_FHO <- round(cm.RF_thresh_FHO[["byClass"]][["Precision"]]*100,2)
F.rf_FHO <- round(2*((prec_RF_FHO*sens_RF_FHO)/(prec_RF_FHO+sens_RF_FHO))/100,2)
#> [1] "Threshold:"
#> [1] 0.36
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 1974225 2387
#> BT 15472 12093
#>
#> Accuracy : 0.9911
#> 95% CI : (0.991, 0.9912)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.5712
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.835152
#> Specificity : 0.992224
#> Pos Pred Value : 0.438709
#> Neg Pred Value : 0.998792
#> Prevalence : 0.007225
#> Detection Rate : 0.006034
#> Detection Prevalence : 0.013754
#> Balanced Accuracy : 0.913688
#>
#> 'Positive' Class : BT
#>
par(pty="s")
rf_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, RF.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="RF FHO ROC Curve")
auc_RF_FHO <- round(rf_roc_FHO[["auc"]],2)
tic("SVM Predict")
SVM.prob_FHO <- predict(svm.radial.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
svm.predict.time <- toc(quiet = TRUE)
SVM.thresh_FHO <- SVM.thresh
SVM.pred_thresh_FHO <- factor(ifelse(SVM.prob_FHO[,2]>SVM.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.SVM_thresh_FHO <- confusionMatrix(factor(SVM.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT")
"Threshold:"
SVM.thresh_FHO
cm.SVM_thresh_FHO
acc_SVM_FHO <- round(cm.SVM_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
thresh_SVM_FHO <- SVM.thresh
sens_SVM_FHO <- round(cm.SVM_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_SVM_FHO <- round(cm.SVM_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_SVM_FHO <- round(((cm.SVM_thresh_FHO[["table"]][2,1])/(cm.SVM_thresh_FHO[["table"]][2,1]+cm.SVM_thresh_FHO[["table"]][2,2]))*100,2)
prec_SVM_FHO <- round(cm.SVM_thresh_FHO[["byClass"]][["Precision"]]*100,2)
F.svm_FHO <- round(2*((prec_SVM_FHO*sens_SVM_FHO)/(prec_SVM_FHO+sens_SVM_FHO))/100,2)
#> [1] "Threshold:"
#> [1] 0.0048
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 1929956 2627
#> BT 59741 11853
#>
#> Accuracy : 0.9689
#> 95% CI : (0.9686, 0.9691)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2666
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.818577
#> Specificity : 0.969975
#> Pos Pred Value : 0.165559
#> Neg Pred Value : 0.998641
#> Prevalence : 0.007225
#> Detection Rate : 0.005914
#> Detection Prevalence : 0.035722
#> Balanced Accuracy : 0.894276
#>
#> 'Positive' Class : BT
#>
par(pty="s")
svm_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, SVM.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="SVM FHO ROC Curve")
auc_SVM_FHO <- round(svm_roc_FHO[["auc"]],2)
| Method | KNN (k = 7) | LDA | QDA | Log. Regression | Random Forest (mtry = 2, nTrees = 500) | SVM (sigma = 9.0201, C = 128) |
|---|---|---|---|---|---|---|
| Accuracy | 99.16% | 97.62% | 97.75% | 98.28% | 99.94% | 98.66% |
| AUC | 99.98% | 98.89% | 99.82% | 99.85% | 99.45% | 99.92% |
| ROC | ||||||
| Threshold | 0.07 | 0.01 | 0.02 | 0.03265 | 0.36 | 0.0048 |
| Sensitivity | 100% | 88.58% | 99.31% | 98.02% | 98.91% | 99.8% |
| Specificity | 99.13% | 97.92% | 97.69% | 98.29% | 99.97% | 98.62% |
| FDR | 20.89% | 41.59% | 41.29% | 34.59% | 0.79% | 29.51% |
| Precision | 79.11% | 58.41% | 58.71% | 65.41% | 99.21% | 70.49% |
| Cost | -1.676710^{6} | -2.86827610^{6} | -1.99884410^{6} | -1.9987410^{6} | -1.61361210^{6} | -1.78678410^{6} |
| F Score | 0.88 | 0.7 | 0.74 | 0.78 | 0.99 | 0.83 |
| Method | KNN (k = 7) | LDA | QDA | Log. Regression | Random Forest (mtry = 2, nTrees = 500) | SVM (sigma = 9.0201, C = 128) |
|---|---|---|---|---|---|---|
| Accuracy | 97.53% | 96.79% | 96.75% | 88.91% | 99.11% | 96.89% |
| AUC | 96.27% | 99.21% | 99.15% | 99.94% | 98.04% | 94.12% |
| ROC | ||||||
| Threshold | 0.07 | 0.01 | 0.02 | 0.03265 | 0.36 | 0.0048 |
| Sensitivity=Recall=Power | 93.18% | 95.72% | 92.38% | 100% | 83.52% | 81.86% |
| Specificity=1-FPR | 97.56% | 96.8% | 96.79% | 88.82% | 99.22% | 97% |
| FDR | 78.23% | 82.11% | 82.7% | 93.89% | 56.13% | 83.44% |
| Precision=PPV | 21.77% | 17.89% | 17.3% | 6.11% | 43.87% | 16.56% |
| F Score | 0.35 | 0.3 | 0.29 | 0.12 | 0.58 | 0.28 |
Selection and Interpretation of Tuning Parameters:
Three models - KNN, RF, and SVM - had tuning parameters.
KNN’s tuning parameter was k which represents the number of observations that the model is told to use to determine which class a new observation is assigned to. My code specified a tuneLength of 10, which resulted a model being fit for ten different k values. The following values were tried for k: 5, 7, 9, 11, 13, 15, 17, 19, 21, 23. The k with the highest accuracy, k=7, was selected as the best fit and used for the final mode.
RF’s tuning parameter was mtry. This parameter represents the number of variables chosen at each sample split as the trees are created. Due to there only being three predictors, values of mtry=2 and mtry = 3 were considered. The model fit with mtry=2 produced a higher accuracy than mtry=3 and was selected as the best fit. It should be noted that difference in accuracy between mtry=2 and mtry=3 incredibly small (hundred thousandths place). I would also like to comment that the parameter ntrees, the number of trees built to fit the model, can also be specified for the model. Caret automatically chose 500 trees. Based on the high accuracy produced by 500 trees I did not investigate this parameter further.
SVM has two tuning parameters C, the cost penalty (which ultimately controls the margin size), and sigma, a scaling factor for the exponential behavior (which referred to as gamma in our textbook). More specifically, this sigma parameter is called the “inverse width parameter in the Gaussian Radial Basis kernel”. The larger your C value the bigger the penalty for misclassifications. This results in fewer support vectors and a smaller margin. Small margins result in lower bias and higher variance. With caret a tuneLength = 10 was run. Ten values for C were fit on different models. Trying to explain how sigma was chosen, I realized only one sigma value was used when tuneLength is called. This thread indicates that svmRadial chooses a single sigma based on the output from sigest a function from the kernlab library. Per the function’s description, sigest returns a range of values for sigma which will return good results. I assume svmRadial chooses one of those sigma values, on what basis this decision is made I’m not sure. If I had realized this sooner, I could have used svmRadialSigma which tunes over both cost and sigma.
Returning to the explanation of what parameters were chosen, C = 128 and sigma = 9.020118. These were the parameters that created the best fit of the model from the 10 options provided by tuneLength. If I had more time to play with the project I would investigate the difference in performance between a model fit with C = 32 and C = 128 on the FHO data. There’s only 0.0000475% difference in the accuracy of models fit with those C values. A smaller C would result in a larger margin. It is possible that with a wider margin there could be better performance in predicting the FHO data.
1. A discussion of the best performing algorithm(s) in the cross-validation and hold-out data
We can see from Table 2, that the random forest model exhibited the highest accuracy of all six models on the training (cross-validation) data. This is also true for the performance of the RF model on the final hold out data.
With the goal of delivering supplies to the greatest number of people without spending too many resources pursuing false positives, (discussion on FHO data why we do this… what the benefits are… potential pitfalls)
2. A discussion or analysis justifying why your findings above are compatible or reconcilable
Why do we see variation like we do… Linear discriminant analysis or (LDA) is another classification technique that models the distribution of the predictors for each class separately uses a linear decision boundary to
From ISLR, reasons to consider the use of LDA include:
3. A recommendation and rationale regarding which algorithm to use for detection of blue tarps
Does the method need to reference the full data set in order to make predictions?
mfrt.glm <- round((glm.time$toc-glm.time$tic), 2)
mfrt.lda <- round((lda.time$toc-lda.time$tic), 2)
mfrt.qda <- round((qda.time$toc-qda.time$tic), 2)
mfrt.knn <- round((knn.time$toc-knn.time$tic), 2)
mfrt.rf <- round((rf.time$toc-rf.time$tic), 2)
mfrt.svm <- round((svm.time$toc-svm.time$tic), 2)
Model Fitting
| Model | Run Time (s) |
|---|---|
| Log. Reg. | 18.39 |
| LDA | 2 |
| QDA | 2.39 |
| KNN | 57.52 |
| RF | 94.42 |
| SVM | 228.74 |
mprt.glm <- round((glm.predict.time$toc-glm.predict.time$tic), 2)
mprt.lda <- round((lda.predict.time$toc-lda.predict.time$tic),2)
mprt.qda <- round((qda.predict.time$toc-qda.predict.time$tic) ,2)
mprt.knn <- round((knn.predict.time$toc-knn.predict.time$tic) ,2)
mprt.rf <- round((rf.predict.time$toc-rf.predict.time$tic),2)
mprt.svm <- round((svm.predict.time$toc-svm.predict.time$tic),2)
Predicting on the FHO
| Model | Run Time (s) |
|---|---|
| Log. Reg. | 7.99 |
| LDA | 9.36 |
| QDA | 7.83 |
| KNN | 445.26 |
| RF | 21.69 |
| SVM | 139.74 |
Run time variation between devices, qualitative measure only. Just make note of computer specs
Computer Specs.
#>
#> Call:
#> roc.default(response = df_factor$Blue_Tarp_or_Not, predictor = glm.prob[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[1], lwd = 3, print.auc = TRUE, print.auc.y = 30, print.auc.x = 60, main = "Training ROC Curve")
#>
#> Data: glm.prob[, 2] in 61219 controls (df_factor$Blue_Tarp_or_Not NBT) < 2022 cases (df_factor$Blue_Tarp_or_Not BT).
#> Area under the curve: 99.85%
#>
#> Call:
#> roc.default(response = df_factor$Blue_Tarp_or_Not, predictor = lda.prob[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[2], lwd = 3, print.auc = TRUE, print.auc.y = 25, print.auc.x = 60, add = TRUE)
#>
#> Data: lda.prob[, 2] in 61219 controls (df_factor$Blue_Tarp_or_Not NBT) < 2022 cases (df_factor$Blue_Tarp_or_Not BT).
#> Area under the curve: 98.89%
#>
#> Call:
#> roc.default(response = df_factor$Blue_Tarp_or_Not, predictor = qda.prob[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[3], lwd = 3, print.auc = TRUE, print.auc.y = 20, print.auc.x = 60, add = TRUE)
#>
#> Data: qda.prob[, 2] in 61219 controls (df_factor$Blue_Tarp_or_Not NBT) < 2022 cases (df_factor$Blue_Tarp_or_Not BT).
#> Area under the curve: 99.82%
#>
#> Call:
#> roc.default(response = df_factor$Blue_Tarp_or_Not, predictor = knn.prob[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[4], lwd = 3, print.auc = TRUE, print.auc.y = 15, print.auc.x = 60, add = TRUE)
#>
#> Data: knn.prob[, 2] in 61219 controls (df_factor$Blue_Tarp_or_Not NBT) < 2022 cases (df_factor$Blue_Tarp_or_Not BT).
#> Area under the curve: 99.98%
#>
#> Call:
#> roc.default(response = df_factor$Blue_Tarp_or_Not, predictor = RF.prob[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[5], lwd = 3, print.auc = TRUE, print.auc.y = 10, print.auc.x = 60, add = TRUE)
#>
#> Data: RF.prob[, 2] in 61219 controls (df_factor$Blue_Tarp_or_Not NBT) < 2022 cases (df_factor$Blue_Tarp_or_Not BT).
#> Area under the curve: 99.45%
#>
#> Call:
#> roc.default(response = df_factor$Blue_Tarp_or_Not, predictor = SVM.prob[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[6], lwd = 3, print.auc = TRUE, print.auc.y = 5, print.auc.x = 60, add = TRUE)
#>
#> Data: SVM.prob[, 2] in 61219 controls (df_factor$Blue_Tarp_or_Not NBT) < 2022 cases (df_factor$Blue_Tarp_or_Not BT).
#> Area under the curve: 99.92%
#>
#> Call:
#> roc.default(response = df_FHO$Blue_Tarp_or_Not, predictor = glm.prob_FHO[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[1], lwd = 3, print.auc = TRUE, print.auc.y = 30, print.auc.x = 60, main = " FHO ROC Curve")
#>
#> Data: glm.prob_FHO[, 2] in 1989697 controls (df_FHO$Blue_Tarp_or_Not NBT) < 14480 cases (df_FHO$Blue_Tarp_or_Not BT).
#> Area under the curve: 99.94%
#>
#> Call:
#> roc.default(response = df_FHO$Blue_Tarp_or_Not, predictor = lda.prob_FHO[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[2], lwd = 3, print.auc = TRUE, print.auc.y = 25, print.auc.x = 60, add = TRUE)
#>
#> Data: lda.prob_FHO[, 2] in 1989697 controls (df_FHO$Blue_Tarp_or_Not NBT) < 14480 cases (df_FHO$Blue_Tarp_or_Not BT).
#> Area under the curve: 99.21%
#>
#> Call:
#> roc.default(response = df_FHO$Blue_Tarp_or_Not, predictor = qda.prob_FHO[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[3], lwd = 3, print.auc = TRUE, print.auc.y = 20, print.auc.x = 60, add = TRUE)
#>
#> Data: qda.prob_FHO[, 2] in 1989697 controls (df_FHO$Blue_Tarp_or_Not NBT) < 14480 cases (df_FHO$Blue_Tarp_or_Not BT).
#> Area under the curve: 99.15%
#>
#> Call:
#> roc.default(response = df_FHO$Blue_Tarp_or_Not, predictor = knn.prob_FHO[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[4], lwd = 3, print.auc = TRUE, print.auc.y = 15, print.auc.x = 60, add = TRUE)
#>
#> Data: knn.prob_FHO[, 2] in 1989697 controls (df_FHO$Blue_Tarp_or_Not NBT) < 14480 cases (df_FHO$Blue_Tarp_or_Not BT).
#> Area under the curve: 96.27%
#>
#> Call:
#> roc.default(response = df_FHO$Blue_Tarp_or_Not, predictor = RF.prob_FHO[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[5], lwd = 3, print.auc = TRUE, print.auc.y = 10, print.auc.x = 60, add = TRUE)
#>
#> Data: RF.prob_FHO[, 2] in 1989697 controls (df_FHO$Blue_Tarp_or_Not NBT) < 14480 cases (df_FHO$Blue_Tarp_or_Not BT).
#> Area under the curve: 98.04%
#>
#> Call:
#> roc.default(response = df_FHO$Blue_Tarp_or_Not, predictor = SVM.prob_FHO[, 2], percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = cbPalette[6], lwd = 3, print.auc = TRUE, print.auc.y = 5, print.auc.x = 60, add = TRUE)
#>
#> Data: SVM.prob_FHO[, 2] in 1989697 controls (df_FHO$Blue_Tarp_or_Not NBT) < 14480 cases (df_FHO$Blue_Tarp_or_Not BT).
#> Area under the curve: 94.12%
4. A discussion of the relevance of the metrics calculated in the tables to this application context
precision and recall. How accuracy doesn’t necessarily serve our context well.
5. Discussion of budget and threshold selection
6. Future Work
Lidar
A sincere thank you to Professor Schwartz and his excellent instruction this semester. In addition, I’m very grateful to my cohort. They inspired much of my learning and helped to reinforce the rest.
#> Run Time: 1104.85 sec elapsed